FastQC is a set of tests that can be used to assess the quality of RNA-seq data. At each step of an RNA-seq pipeline, the number of low-quality reads is reduced:
At each step of the RNA-seq pipeline, FastQC was run to assess the quality of the reads. First, we extract the FastQC information from the FastQC reports located at /data/biohub/ on Phoenix.
dataDir <- "/Volumes/biohub/2017_Lardelli_K97Gfs"
fastQC_0_raw <- list.files(file.path(dataDir, "0_rawData", "fastqc"), pattern = ".zip", full.names = TRUE) %>% getFastqcData()
## Warning: package 'bindrcpp' was built under R version 3.3.2
fastQC_1_trim <-list.files(file.path(dataDir, "1_trimmedData", "fastqc"), pattern = ".zip", full.names = TRUE) %>% getFastqcData()
fastQC_2_rRNArem <- list.files(file.path(dataDir, "2_rRNAremovedData", "fastqc"), pattern = ".zip", full.names = TRUE) %>% getFastqcData()
fastQC_3_aligned <- list.files(file.path(dataDir, "3_alignedData", "fastqc"), pattern = ".zip", full.names = TRUE) %>% getFastqcData()
fastQC_4_dedup <- list.files(file.path(dataDir, "4_dedupData", "fastqc"), pattern = ".zip", full.names = TRUE) %>% getFastqcData()
fastQC_5_clean <- list.files(file.path(dataDir, "5_cleanRData", "fastqc"), pattern = ".zip", full.names = TRUE) %>% getFastqcData()
fastQC_all <- as.list(fastQC_0_raw, fastQC_1_trim, fastQC_2_rRNArem, fastQC_3_aligned, fastQC_4_dedup, fastQC_5_clean)
Summary plots can be produced to summarise the PASS, FAIL or WARNING statuses for each of the FastQC tests for each RNA-seq library:
fastqcReports::plotSummary(fastQC_0_raw)
fastqcReports::plotSummary(fastQC_1_trim)
fastqcReports::plotSummary(fastQC_2_rRNArem)
fastqcReports::plotSummary(fastQC_3_aligned)
fastqcReports::plotSummary(fastQC_4_dedup)
fastqcReports::plotSummary(fastQC_5_clean)
RAW DATA
WARNING statuses for their Per base sequence quality.WARNING statuses for their Per sequence GC content.WARNING statuses for their Sequence Length Distribution.WARNING statuses for their Sequence Duplication Levels and three of these libraries, corresponding to read 1 in library 12, and reads 1 and 2 for library 6. Interestingly, library 6 also has a significantly larger number of reads compared to the other libraries.WARNING statuses for their Overrepresented sequences.FAIL statuses for their Adapter Content, indicating that these still need to be removed.FAIL statuses for their Kmer Content.TRIMMED DATA
Per base sequence quality to PASS for all RNA-seq libraries. This is as expected, as the trimming step also includes a quality-filtering step to filter out all reads with quality score < 20.Sequence Duplication Levels status of library 12, read 1 from FAIL to WARNING.per base sequence content has a FAIL status with all samples.PASS statuses for Sequence Length Distribution changing into WARNING statuses.The number of reads lost/discarded at each step are shown in the following table.
| RNAseq_Library | Raw | Trimmed | rRNA_removed | Aligned | Deduplicated | DNA_removed |
|---|---|---|---|---|---|---|
| 1_non_mutant_K97Gfs_24mth_13_03_2014_S1_fem_R1 | 54,955,454 | 54,893,882 | 54,893,790 | 50,933,145 | 38,308,550 | 36,879,228 |
| 10_psen1K97Gfshet_6mth_10_03_2016_S1_fem_R1 | 78,164,332 | 78,117,636 | 78,117,166 | 72,377,056 | 54,263,688 | 52,047,915 |
| 11_psen1K97Gfshet_6mth_10_03_2016_S2_fem_R1 | 88,387,480 | 88,341,182 | 88,340,660 | 81,314,599 | 61,911,413 | 59,562,064 |
| 12_psen1K97Gfshet_6mth_10_03_2016_S3_fem_R1 | 76,569,314 | 76,532,984 | 76,532,406 | 71,789,420 | 53,280,588 | 51,351,547 |
| 2_non_mutant_K97Gfs_24mth_13_03_2014_S2_fem_R1 | 69,810,372 | 69,742,212 | 69,742,048 | 64,891,457 | 48,554,022 | 46,752,384 |
| 3_non_mutant_K97Gfs_24mth_13_03_2014_S3_fem_R1 | 57,292,450 | 57,227,560 | 57,227,430 | 53,481,065 | 39,742,298 | 38,238,799 |
| 4_non_mutant_K97Gfs_6mth_10_03_2016_S1_fem_R1 | 92,739,344 | 92,685,264 | 92,684,444 | 85,502,011 | 66,177,061 | 63,677,470 |
| 5_non_mutant_K97Gfs_6mth_10_03_2016_S2_fem_R1 | 78,013,106 | 77,935,126 | 77,934,538 | 73,657,946 | 56,904,193 | 54,654,502 |
| 6_non_mutant_K97Gfs_6mth_10_03_2016_S3_fem_R1 | 166,154,748 | 166,061,488 | 166,060,174 | 153,364,755 | 108,442,547 | 104,122,705 |
| 7_psen1K97Gfshet_24mth_13_03_2014_S1_fem_R1 | 54,364,124 | 54,299,138 | 54,299,006 | 50,269,951 | 38,004,152 | 36,585,600 |
| 8_psen1K97Gfshet_24mth_13_03_2014_S2_fem_R1 | 71,216,754 | 71,142,748 | 71,142,274 | 65,911,748 | 49,193,090 | 47,231,972 |
| 9_psen1K97Gfshet_24mth_13_03_2014_S3_fem_R1 | 76,720,686 | 76,631,824 | 76,631,444 | 71,210,915 | 52,995,751 | 51,048,580 |
hisat2logs_3_aligned <- list.files(file.path(dataDir, "3_alignedData", "info"), pattern = ".info", full.names = TRUE) %>%
importHisat2Logs() %>%
extract(, c("Filename", "Total_Reads", "Unique_In_Pairs", "Not_Aligned", "Alignment_Rate"))
fastqcReports::readTotals(fastQC_0_raw) %>% View
fastqcReports::plotSequenceQualities(fastQC_1_trim) + guides(colour = FALSE)
fastqcReports::plotSequenceQualities(fastQC_2_rRNArem) + guides(colour = FALSE)
fastqcReports::plotSequenceQualities(fastQC_3_aligned) + guides(colour = FALSE)
fastqcReports::plotSequenceQualities(fastQC_4_dedup) + guides(colour = FALSE)
fastqcReports::plotSequenceQualities(fastQC_5_clean) + guides(colour = FALSE)
fastqcReports::plotSequenceQualities(fastQC_0_raw) + guides(colour = FALSE)
fastqcReports::plotSequenceQualities(fastQC_1_trim) + guides(colour = FALSE)
fastqcReports::plotSequenceQualities(fastQC_2_rRNArem) + guides(colour = FALSE)
fastqcReports::plotSequenceQualities(fastQC_3_aligned) + guides(colour = FALSE)
fastqcReports::plotSequenceQualities(fastQC_4_dedup) + guides(colour = FALSE)
fastqcReports::plotSequenceQualities(fastQC_5_clean) + guides(colour = FALSE)
fastqcReports::plotGcContent(fastQC_0_raw) + guides(colour = FALSE)
fastqcReports::plotGcContent(fastQC_1_trim) + guides(colour = FALSE)
fastqcReports::plotGcContent(fastQC_2_rRNArem) + guides(colour = FALSE)
fastqcReports::plotGcContent(fastQC_3_aligned) + guides(colour = FALSE)
fastqcReports::plotGcContent(fastQC_4_dedup) + guides(colour = FALSE)
fastqcReports::plotGcContent(fastQC_5_clean) + guides(colour = FALSE)